Quick Start

Most of the plots are interactive, you can click or zoom to get more details ! Also don’t hesitate to click on plots, they will zoom automatically !

Loading Packages

library(data.table) # Efficient Dataframe 
library(lubridate) # For Dates 
library(tidyverse) # Multiple Package for Useful Data wrangling
library(esquisse) # Intuitive plotting
library(plyr) # Data splitting
library(ggplot2) # Plot Graphs
library(naniar) # for NA exploration in Dataframe
library(sp) # spatial data
library(plotly) # Make ggplot2 Dynamic
library(gissr) # Spatial Transformations
library(leaflet) # For Map
library(leaflet.providers) # For Custom Icons
library(geosphere) # Spatial Calculations
library(DT) # Render Table in a explorable UI
library(gridExtra) # Multiple Plot at once
library(corrplot) # Correlation Plot
library(RColorBrewer) # For Color Palette
library(rmdformats) # Theme of HTML
library(manipulateWidget) # Handling multiple plotly graphs

Those are required packages

Geosphere: Spherical trigonometry for geographic applications. That is, compute distances and related measures for angular (longitude/latitude) locations.

Gissr: gissr is a collection of R functions which make working with spatial data easier.

Ex 3.4

Loading Datas and Cleaning

Loading the dataset called “LaptopSales_red.csv” given for the Homework

FALSE Classes 'data.table' and 'data.frame':    148786 obs. of  17 variables:
FALSE  $ V1                    : int  171289 38634 260048 166045 243280 118859 249957 198058 198850 267007 ...
FALSE  $ Date                  : chr  "9/20/2008 2:49" "5/30/2008 9:52" "12/10/2008 9:26" "9/15/2008 9:41" ...
FALSE  $ Configuration         : int  528 307 235 168 517 738 301 301 479 472 ...
FALSE  $ Customer.Postcode     : chr  "NW5 1SP" "N6 6BU" "CR0 2BW" "WC2H 9PS" ...
FALSE  $ Store.Postcode        : chr  "N3 1DH" "N3 1DH" "CR7 8LE" "SW1P 3AU" ...
FALSE  $ Retail.Price          : int  413 515 315 NA 580 535 455 465 600 392 ...
FALSE  $ Screen.Size..Inches.  : int  17 15 15 15 17 17 15 15 17 17 ...
FALSE  $ Battery.Life..Hours.  : int  4 6 5 5 4 6 6 6 4 4 ...
FALSE  $ RAM..GB.              : int  2 1 2 1 2 1 1 1 1 1 ...
FALSE  $ Processor.Speeds..GHz.: num  2.4 2 2.4 2 2.4 2 1.5 1.5 2.4 2.4 ...
FALSE  $ Integrated.Wireless.  : chr  "No" "Yes" "No" "Yes" ...
FALSE  $ HD.Size..GB.          : int  300 80 80 300 120 40 120 120 300 300 ...
FALSE  $ Bundled.Applications. : chr  "No" "Yes" "Yes" "No" ...
FALSE  $ customer.X            : int  528771 528281 532781 530190 537350 532498 533130 529390 533998 532498 ...
FALSE  $ customer.Y            : int  186041 187336 166444 181139 169306 168334 182489 181270 168421 168334 ...
FALSE  $ store.X               : int  525109 525109 532714 529902 528739 528739 534057 528924 528739 532714 ...
FALSE  $ store.Y               : int  190628 190628 168302 179641 173080 173080 179682 178440 173080 168302 ...
FALSE  - attr(*, ".internal.selfref")=<externalptr>

Retail Price is the only variable missing at rate of approximately 4.5%

a.Price Questions:

i. At What Price are the laptops actually selling ?

This Histogram shows the most frequent retail prices for all stores in 2018. In Black is the median

We can interpret this boxplot as the mean or median retail price of the 2018 Computer Dataset, click on the white sphere to get the mean !

## [1] "Last Recorded Prices are 406  and 530  on the same Day with a mean of 468 "

Here is given the last recorded prices for 2018

ii. Does price change with time?

Those Plots show different aggregations levels, can be used depending on the analysis we want, thus the granularity need. End of the weekdays is generally having higher retail prices, such as the year starting with month period from May to December.

iii. Are prices consistent across retail outlets?

Each box plots belongs to a specific stores, we can see a common trend across all stores in 2018. We also see that 5 stores tend to have a lower retail price than others with the median closer to 465.

Looking at times series, we can see that not all stores have the same time trend, but most of them do.

iv. How does price change with configuration?

FALSE `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Using an smooth approximator, we can see two differents trends, first a rapid increase in price while being at low configurations, and then the slope tend to stay constant and low, ending with a increase with highest configurations.

b.Location Questions

i. Where are the stores and customers locatd?

Enjoy looking at each stores and customers in London UK ! You can find there exact location by clicking on them ! We can see a big cluster of 545 clients/stores in the center of City London.

transform_coordinates: Is a convinient function from Gissr (on Gihtub) that use the cran-project SpTransform as source code but can directly use coordinates in a dataframe and return it in a dataframe. The spTransform methods provide transformation between datum(s) and conversion between projections (also known as projection and/or re-projection), from one unambiguously specified coordinate reference system (CRS) to another, prior to version 1.5 using Proj4 projection arguments.

ii. Which stores are selling the most?

The following histograms show two ways of analyzing the stores sales results: by the number of transactions or the sales revenues they each generated during 2018. The Store SW1P 3AU sold the most and with the highest revenues.

iii. How far would customers travel to buy a laptop ?

With this plot we can see the distance between Customers and Stores in terms of latitude and longitude.

iv. How far would customers travel to buy a laptop ? - Alternative

DistHarversine: The shortest distance between two points (i.e., the ’great-circle-distance’ or ’as the crow flies’), according to the ’haversine method’. This method assumes a spherical earth, ignoring ellipsoidal effects. The Haversine (’half-versed-sine’) formula was published by R.W. Sinnott in 1984, although it has been known for much longer. At that time computational precision was lower than today (15 digits precision). With current precision, the spherical law of cosines formula appears to give equally good results down to very small distances.

Each Unique Customer can be found here, swipe on the right and see the distance they need to travel to get to their store.

Histogram of the Distance between Clients and Stores, with median Distance being approximately 4203 meters.

c.Revenue Questions

i. How do the sales volume in each store relate to Acell’s revenues?

You can see the proportional revenues participation of each stores in 2018. SW1P 3AU still is the store contributing the most to Acell’s Revenues.

ii. How does this relationship depend on the configuration?

We can see that S1P 3AU propose higher configurations, while having the smallest % revenues participation out of the total revenues of the company, this could be because it sells higher priced configurations, thus selling less to customer during the year, only to a smaller client pool that wants a better PC for more productive computing work.

d.Configuration Questions

i. What are the details of each configuration? How does this relate to price?

Depending on the details of each configurations, we can see that some specs tend to increase the price higher, such as the screen size, high RAM and high battery life.

ii. Do all stores sell all configurations?

With this multiple facets barplots, you can spot which configuration is less or not sold depending on the store. S1P 3AU is not selling every configurations.

Ex 4.1

Loading Datas and Cleaning

Loading the dataset called “Cereals.csv” given for the Homework

FALSE Classes 'data.table' and 'data.frame':    77 obs. of  16 variables:
FALSE  $ name    : chr  "100%_Bran" "100%_Natural_Bran" "All-Bran" "All-Bran_with_Extra_Fiber" ...
FALSE  $ mfr     : chr  "N" "Q" "K" "K" ...
FALSE  $ type    : chr  "C" "C" "C" "C" ...
FALSE  $ calories: int  70 120 70 50 110 110 110 130 90 90 ...
FALSE  $ protein : int  4 3 4 4 2 2 2 3 2 3 ...
FALSE  $ fat     : int  1 5 1 0 2 2 0 2 1 0 ...
FALSE  $ sodium  : int  130 15 260 140 200 180 125 210 200 210 ...
FALSE  $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
FALSE  $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
FALSE  $ sugars  : int  6 8 5 0 8 10 14 8 6 5 ...
FALSE  $ potass  : int  280 135 320 330 NA 70 30 100 125 190 ...
FALSE  $ vitamins: int  25 0 25 25 25 25 25 25 25 25 ...
FALSE  $ shelf   : int  3 3 3 3 3 1 2 3 1 3 ...
FALSE  $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
FALSE  $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
FALSE  $ rating  : num  68.4 34 59.4 93.7 34.4 ...
FALSE  - attr(*, ".internal.selfref")=<externalptr>

We can see that Carbo and Sugars are missing at level 1.3% (approx.) and Potass at level 2.6% (approx.)

a.

Ordinal: shelf, rating

Nominal: name, mfr, type

Quantitative/Numerical: calories, protein, fat, sodium , sugars, potass, weight, cups, vitamins, fiber, carbo

b.

Summary

FALSE      name               mfr                type              calories  
FALSE  Length:74          Length:74          Length:74          Min.   : 50  
FALSE  Class :character   Class :character   Class :character   1st Qu.:100  
FALSE  Mode  :character   Mode  :character   Mode  :character   Median :110  
FALSE                                                           Mean   :107  
FALSE                                                           3rd Qu.:110  
FALSE                                                           Max.   :160  
FALSE     protein           fat        sodium          fiber            carbo      
FALSE  Min.   :1.000   Min.   :0   Min.   :  0.0   Min.   : 0.000   Min.   : 5.00  
FALSE  1st Qu.:2.000   1st Qu.:0   1st Qu.:135.0   1st Qu.: 0.250   1st Qu.:12.00  
FALSE  Median :2.500   Median :1   Median :180.0   Median : 2.000   Median :14.50  
FALSE  Mean   :2.514   Mean   :1   Mean   :162.4   Mean   : 2.176   Mean   :14.73  
FALSE  3rd Qu.:3.000   3rd Qu.:1   3rd Qu.:217.5   3rd Qu.: 3.000   3rd Qu.:17.00  
FALSE  Max.   :6.000   Max.   :5   Max.   :320.0   Max.   :14.000   Max.   :23.00  
FALSE      sugars           potass          vitamins          shelf      
FALSE  Min.   : 0.000   Min.   : 15.00   Min.   :  0.00   Min.   :1.000  
FALSE  1st Qu.: 3.000   1st Qu.: 41.25   1st Qu.: 25.00   1st Qu.:1.250  
FALSE  Median : 7.000   Median : 90.00   Median : 25.00   Median :2.000  
FALSE  Mean   : 7.108   Mean   : 98.51   Mean   : 29.05   Mean   :2.216  
FALSE  3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00   3rd Qu.:3.000  
FALSE  Max.   :15.000   Max.   :330.00   Max.   :100.00   Max.   :3.000  
FALSE      weight           cups            rating     
FALSE  Min.   :0.500   Min.   :0.2500   Min.   :18.04  
FALSE  1st Qu.:1.000   1st Qu.:0.6700   1st Qu.:32.45  
FALSE  Median :1.000   Median :0.7500   Median :40.25  
FALSE  Mean   :1.031   Mean   :0.8216   Mean   :42.37  
FALSE  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:50.52  
FALSE  Max.   :1.500   Max.   :1.5000   Max.   :93.70

Standard Errors

FALSE       name        mfr       type   calories    protein        fat     sodium 
FALSE         NA         NA         NA 19.8438928  1.0758016  1.0068260 82.7697871 
FALSE      fiber      carbo     sugars     potass   vitamins      shelf     weight 
FALSE  2.4233912  3.8916746  4.3591113 70.8786815 22.2943521  0.8320674  0.1534155 
FALSE       cups     rating 
FALSE  0.2357153 14.0337125

c.

Histogram of Quantitative Variables

Standards Errors

FALSE   calories    protein        fat     sodium     sugars     potass     weight 
FALSE 19.4841191  1.0947897  1.0064726 83.8322952  4.3786564 70.4106360  0.1504768 
FALSE       cups   vitamins      fiber      carbo 
FALSE  0.2327161 22.3425225  2.3833640  3.9073256

i. Which variables have the largest variability?

Based on the Histogram Grid and the Standard Errors Summary, Sodium, Potass and Vitamins have the largest variability.

ii. Which variables seem skedew?

Potassium, Fiber and Fat seem skewed. Cups could also be.

iii. Are there any values that seem extreme?

We can see that Fiber has at least 3 extremes values (2 classes away from the main cluster) have extremes values. We could check with some boxplots to better see what are those outliers.

Multiple Boxplots for outliers detections

d.

We are lacking data about Hot Type Cereals to compare both state of cereals.

e.

Shelf 1 and 3 are pretty close (both median close to 40-42), we could use a statistical test for comparing the three boxplots and see if there is a real median/mean differences. Without doing any statistical test, we can also see that the boxplots are overlapping for category 1 and 3, meaning we could interpret them as identical groups on average.

f.

i. Which pair of variables is most strongly correlated?

Correlation Matrix

FALSE             calories     protein           fat        sodium       sugars
FALSE calories  1.00000000  0.03399166  0.5073732397  0.2962474981  0.569120535
FALSE protein   0.03399166  1.00000000  0.2023533963  0.0115588913 -0.286583967
FALSE fat       0.50737324  0.20235340  1.0000000000  0.0008219036  0.287152487
FALSE sodium    0.29624750  0.01155889  0.0008219036  1.0000000000  0.037058961
FALSE sugars    0.56912054 -0.28658397  0.2871524866  0.0370589612  1.000000000
FALSE potass   -0.07136125  0.57874284  0.1996367171 -0.0394380876  0.001413982
FALSE weight    0.69645215  0.23067141  0.2217141647  0.3125335701  0.460547135
FALSE cups      0.08919615 -0.24209861 -0.1575787041  0.1195841083 -0.032436100
FALSE vitamins  0.25984556  0.05479952 -0.0305139099  0.3315759640  0.072954382
FALSE fiber    -0.29521183  0.51400610  0.0140358654 -0.0707349230 -0.150948502
FALSE carbo     0.27060605 -0.03674326 -0.2849336855  0.3284091857 -0.452069189
FALSE                potass     weight        cups    vitamins       fiber
FALSE calories -0.071361247  0.6964521  0.08919615  0.25984556 -0.29521183
FALSE protein   0.578742837  0.2306714 -0.24209861  0.05479952  0.51400610
FALSE fat       0.199636717  0.2217142 -0.15757870 -0.03051391  0.01403587
FALSE sodium   -0.039438088  0.3125336  0.11958411  0.33157596 -0.07073492
FALSE sugars    0.001413982  0.4605471 -0.03243610  0.07295438 -0.15094850
FALSE potass    1.000000000  0.4205615 -0.50168832 -0.00263583  0.91150392
FALSE weight    0.420561534  1.0000000 -0.20171465  0.32043480  0.24629218
FALSE cups     -0.501688318 -0.2017146  1.00000000  0.13362965 -0.51369716
FALSE vitamins -0.002635830  0.3204348  0.13362965  1.00000000 -0.03871734
FALSE fiber     0.911503921  0.2462922 -0.51369716 -0.03871734  1.00000000
FALSE carbo    -0.365002934  0.1448053  0.35828371  0.25357897 -0.37908370
FALSE                carbo
FALSE calories  0.27060605
FALSE protein  -0.03674326
FALSE fat      -0.28493369
FALSE sodium    0.32840919
FALSE sugars   -0.45206919
FALSE potass   -0.36500293
FALSE weight    0.14480528
FALSE cups      0.35828371
FALSE vitamins  0.25357897
FALSE fiber    -0.37908370
FALSE carbo     1.00000000

Fiber and Potass seems to have a strong correlation

ii. How can we reduce the number of variables based on these correlations?

We could select the highest correlated variable (because of threat of multicollinearity) and removed them. In the context of a Regression, using VIF on our model would suggest us which explanatory variabes we should remove based on those correlations table.

iii. How would the correlations change if we normalized the data first?

Correlation Matrix

FALSE             calories     protein           fat        sodium       sugars
FALSE calories  1.00000000  0.03399166  0.5073732397  0.2962474981  0.569120535
FALSE protein   0.03399166  1.00000000  0.2023533963  0.0115588913 -0.286583967
FALSE fat       0.50737324  0.20235340  1.0000000000  0.0008219036  0.287152487
FALSE sodium    0.29624750  0.01155889  0.0008219036  1.0000000000  0.037058961
FALSE sugars    0.56912054 -0.28658397  0.2871524866  0.0370589612  1.000000000
FALSE potass   -0.07136125  0.57874284  0.1996367171 -0.0394380876  0.001413982
FALSE weight    0.69645215  0.23067141  0.2217141647  0.3125335701  0.460547135
FALSE cups      0.08919615 -0.24209861 -0.1575787041  0.1195841083 -0.032436100
FALSE vitamins  0.25984556  0.05479952 -0.0305139099  0.3315759640  0.072954382
FALSE fiber    -0.29521183  0.51400610  0.0140358654 -0.0707349230 -0.150948502
FALSE carbo     0.27060605 -0.03674326 -0.2849336855  0.3284091857 -0.452069189
FALSE                potass     weight        cups    vitamins       fiber
FALSE calories -0.071361247  0.6964521  0.08919615  0.25984556 -0.29521183
FALSE protein   0.578742837  0.2306714 -0.24209861  0.05479952  0.51400610
FALSE fat       0.199636717  0.2217142 -0.15757870 -0.03051391  0.01403587
FALSE sodium   -0.039438088  0.3125336  0.11958411  0.33157596 -0.07073492
FALSE sugars    0.001413982  0.4605471 -0.03243610  0.07295438 -0.15094850
FALSE potass    1.000000000  0.4205615 -0.50168832 -0.00263583  0.91150392
FALSE weight    0.420561534  1.0000000 -0.20171465  0.32043480  0.24629218
FALSE cups     -0.501688318 -0.2017146  1.00000000  0.13362965 -0.51369716
FALSE vitamins -0.002635830  0.3204348  0.13362965  1.00000000 -0.03871734
FALSE fiber     0.911503921  0.2462922 -0.51369716 -0.03871734  1.00000000
FALSE carbo    -0.365002934  0.1448053  0.35828371  0.25357897 -0.37908370
FALSE                carbo
FALSE calories  0.27060605
FALSE protein  -0.03674326
FALSE fat      -0.28493369
FALSE sodium    0.32840919
FALSE sugars   -0.45206919
FALSE potass   -0.36500293
FALSE weight    0.14480528
FALSE cups      0.35828371
FALSE vitamins  0.25357897
FALSE fiber    -0.37908370
FALSE carbo     1.00000000

Nothing changes when we normalized the data before correlation matrices and plots since normalization already occurs when computing correlations.